Skip to main content
Version: 2.8

4 Data Generation

Before we will go into detail, let’s provide an overview of data generation process.

To initiate, setup and build a project (i.e. group of data you would like to anonymize) follow these steps. See evl datagen command for details about ‘evl datagen’ commands.

  1. Create new project

    evl datagen project new <project_dir>

    See Project for details about projects.

  2. Add a source, i.e. folder with files to be anonymized or database with tables to be anonymized:

    evl datagen source new <source_name> \
    --guess-from-csv <path_to_folder_with_such_CSVs>

    See Source Settings for details about settings for a source.

  3. Edit such a config (csv) file according to your preferences. (Excel file checks the validity immediately and provides drop down options.)

  4. Check the config file for mistakes

    evl datagen check <config_file>
  5. Generate anonymization jobs and workflow

    evl datagen build <config_file>

    See Build and Run for details about jobs and workflow generation and see Config File for details about a config file.

Then to anonymize (regularly), run anonymization jobs:

evl run/datagen/<table_1>.evl
evl run/datagen/<file_1>.evl
...

Each job represents one file or table to be anonymized. See Build and Run for details.

Note: Be careful running anonymization jobs several times, as data are by default overwritten in the target, unless export EVL_DATAGEN_APPEND=1 is specified in settings configs/datagen/*.sh file or project.sh.

See Environment variables for details about all possible configuration EVL_DATAGEN_* variables.

Having many files or tables to anonymize in one batch, you don’t need to run anonymization jobs one after another, but you can run all jobs by running generated workflow:

evl run workflow/datagen/<source_name>.ewf

4.1 evl datagen command

(since EVL 1.0)

To help to generate, check and build all the configuration files, there is ‘evl datagen’ command line utility.

evl datagen project new <project_dir>
creates new project folder <project_dir> with default folder structure and files inside.

evl datagen project sample <project_dir>
creates new project folder <project_dir> with sample data and configs.

evl datagen source new <source_name>
creates new source <source_name> in current project directory (or in <project_dir>). With ‘--guess-from-csv’ option, it guess data types based on source csv files.

evl datagen check <config_file>
check if <config_file> contains valid combination of metadata.

evl datagen build <config_file>
generates data-generation jobs based on <config_file> and also a Workflow with all these jobs.

Synopsis

evl datagen project
( new | sample ) <project_dir>
[-v|--verbose]

evl datagen source new
<source_name>
[-p|--project <project_dir>]
[-g|--guess-from-csv <source_dir>]
[-v|--verbose]

evl datagen check
<config_file>
[-p|--project <project_dir>]
[-v|--verbose]

evl datagen build
<config_file>
[-p|--project <project_dir>]
[--parallel [<parallel_threads>]]
[-v|--verbose]

evl datagen
( --help | --usage | --version )

Options

-p, --project=<project_dir>
if the current directory is not a project’s one, full or relative path can be specified by <project_dir>

--parallel[=<parallel_threads>]
generate workflow with jobs parallelized as much as possible. To limit this parallelization to, <parallel_threads> can be specified, which is the value how many jobs can run in parallel.

-g, --guess-from-csv=<source_dir>
preserve mode (i.e. permission), timestamps and ownership

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Environment Variables

The list of all EVL Data Generation variables with their default values. One can change these values in his ‘~/.evlrc’ file or in the project in ‘project.sh’.

EVL_DATAGEN_APPEND=0
whether append or overwrite target files/tables. Possible values are ‘0’ or ‘1’.

EVL_DATAGEN_EOL=""
whether Linux (‘\n’), Windows (‘\r\n’) or old Mac (‘\r’) end-of-lines. Possible values are "dos", "mac", or leave empty for Linux EOL.

EVL_DATAGEN_HEADER=1
whether or how many lines has file header. Zero means no header.

EVL_CONFIG_EOL=""
whether Linux (‘\n’), Windows (‘\r\n’) or old Mac (‘\r’) end-of-lines are used for main config CSV file. Possible values are "dos", "mac", or leave empty for Linux EOL.

EVL_CONFIG_FIELD_SEPARATOR=";"
the default field separator used in config files

EVL_DEFAULT_FIELD_SEPARATOR=";"
the default field separator for CSV files. This character might be any one of the first 128 ascii ones.

EVL_DEFAULT_RECORD_SEPARATOR='\n'
the default record separator for CSV files. This character might be any one of the first 128 ascii ones. By default a Linux newline is used. To use Windows end of line (i.e. ‘\r\n’), use ‘EVL_DATAGEN_EOL’ variable


4.2 Project

Consider an anonymization project to be a folder, where we work on anonymization of some group of data. For example a group of data from business point of view. In most cases there would be only one or a couple of projects.

You can create a new project by hand or by a command:

evl datagen project new my_project

It will create new directory my_project in current folder with default settings and subfolder structure.

Or you can a new project with sample data and configuration:

evl datagen project sample $HOME/my_sample_project

It will create new directory my_sample_project in your home folder with a sample project.

The anonymization project directory structure is:

build/
files generated by ‘evl datagen build’ command

configs/
configuration csv files and settings sh files

lib/
folder for custom anonymization functions

run/
anonymization jobs generated by ‘evl datagen build’ command

worflow/
workflows generated by ‘evl datagen build’ command

All files in build, run and workflow directories are completely generated based on configuration file(s) configs/<source_name>.csv.


4.3 Source Settings

Once we have a project directory, we would like to add a source, which could be a folder with files or a database.

What and how should be anonymized is specified in a config and setting files. Config file could be a csv file and setting file is a shell script with variables definitions.

Each source would have one config and one setting file.

To create a new empty config and setting files, run:

evl datagen source new my_source

which creates two files in current project folder

configs/my_source.csv
configs/datagen/my_source.sh

To create a pre-generated config and setting files, based on a folder with source csv files:

evl datagen source new my_source --guess-from-csv=data/source

which goes through all csv files in data/source folder and fill in config file entity names (i.e. file names), field names based on headers, data types and null flag of a field.

If the current directory is not the project’s one, specify the path to the project by option ‘--project=<project_path>’.

See Config File for detailed information about config files.


4.4 Build and Run

For each Entity from config file, i.e. table or file, anonymization job with mapping and other metadata need to be build. It is enough to run the command line utility

evl datagen build <config_file>
[-p|--project <project_dir>]
[--parallel [<parallel_threads>]]
[-v|--verbose]

That build all the files in build/ project subdirectory. There you can find evd and evm files in appropriate folders. EVD means EVL Data definition file and it defines the structure of the source/target; field names, data types and other attributes. EVM means EVL Mapping file and it defines how each field is mapped. Although both these files are generated, it is sometimes good to check how they are look like for debug purpose.

It generates also a file in run/datagen/ subdirectory, where you can find one evl file per each Entity. These files can be then run to anonymize the data. For example for three tables, party_addr, party_cont and party_rel it would be fired by these commands:

evl run/datagen/party_addr.evl
evl run/datagen/party_cont.evl
evl run/datagen/party_rel.evl

Once such evl file exists for an Entity, there no need to build jobs again. It check each run if the config file has changed or not for given Entity and run ‘evl datagen build’ command automatically.

Note: There is no need to run ‘evl datagen build’ every time the config file is updated. It is done automatically once the job is fired.

The build command also generates a workflow file for given source in workflow/datagen/ subdirectory. You can run the anonymization for all the Entities from that source. For example having source defined by configs/some_source.csv, you can run

evl run workflow/datagen/some_source.ewf

and it will run all anonymization jobs in one or several parallel threads. It depends on the value defined by --parallel option.

If one or more anonymization jobs in a workflow fail, then you can the restart the whole workflow by:

evl restart workflow/datagen/some_source.ewf

or continue from those last failures:

evl continue workflow/datagen/some_source.ewf